CodeGraph: Enhancing Graph Reasoning of LLMs with Code

Paper: arXiv:2402.00000
Authors: [Author List]
Published: [Year]

📄 Abstract

This paper introduces CodeGraph, a novel method that significantly improves the ability of Large Language Models (LLMs) to solve fundamental graph problems. Instead of asking the LLM to reason and calculate the answer in natural language, CodeGraph prompts the LLM to generate a Python program that solves the problem. This program is then executed by a reliable external interpreter to get the final, accurate answer.

🚧 The Problem with Existing Methods

Current approaches typically convert a graph's structure into a textual description and feed it to an LLM. The authors identify three major flaws with this method:

Arithmetic Errors: LLMs are surprisingly bad at precise calculations required for graph tasks, such as accurately counting nodes or edges.
Prompt Sensitivity: Performance is highly dependent on how the graph is described. A slight change in wording can lead to a completely different answer.
Structural Bias: LLMs tend to be biased by the shape of the graph (e.g., correctly identifying cycles in dense graphs but failing on simple paths).

🛠️ How CodeGraph Works

CodeGraph reframes the task to play to the LLM's strengths (understanding context and generating code) while avoiding its weaknesses (precise calculation). The process works as follows:

Prompting with Exemplars: The LLM is given a few examples (a "few-shot" setting) that show how a graph problem is translated into a complete Python script.
Code Generation: For a new graph problem, the LLM is instructed to:
- Translate the textual description of the graph into Python data structures (e.g., a list of tuples for the edges).
- Generate a Python function to perform the required reasoning.
Execution: The generated code is executed by a standard Python interpreter, which performs the computation flawlessly and provides the final answer.

This method effectively separates the reasoning (generating the logic as code) from the computation (running the code).

📊 Key Findings and Results

The authors tested CodeGraph on the GraphQA benchmark across six different graph tasks using various LLMs (including GPT-3.5 Turbo, Llama3-70B, and Mixtral models).

Massive Performance Boost: CodeGraph dramatically outperformed baseline methods, improving accuracy by 1.3% to 58.6%. The biggest gains were in arithmetic-heavy tasks.
Robust and Consistent: The method proved highly robust, performing consistently well across different graph descriptions and structures.
Generalizable: CodeGraph is effective across different LLMs. Even smaller models showed significant improvement.

✅ Conclusion

The paper successfully demonstrates that using LLMs as code generators rather than calculators is a far more effective, reliable, and interpretable approach for graph reasoning. By delegating computation to an external interpreter, CodeGraph overcomes the inherent limitations of LLMs in arithmetic and logical precision.

📄 Abstract​

🚧 The Problem with Existing Methods​

🛠️ How CodeGraph Works​

📊 Key Findings and Results​

✅ Conclusion​

📄 Abstract

🚧 The Problem with Existing Methods

🛠️ How CodeGraph Works

📊 Key Findings and Results

✅ Conclusion